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ABSTRACT 

Plants have large diverse families of small secreted 
proteins (SSPs) that play critical roles in the 
processes of development, differentiation, defense, 
flowering, stress response, symbiosis, etc. Oryza 
sativa is one of the major crops worldwide and an 
excellent model for monocotyledonous plants. 
However, there had not been any effort to system- 
atically analyze rice SSPs. Here, we constructed a 
comparative platform, OrysPSSP (http://www. 
genoportal.org/PSSP/index.do), involving 
> 100 000 SSPs from rice and 25 plant species. 
OrysPSSP is composed of a core SSP database 
and a dynamic web interface that integrates a 
variety of user tools and resources. The current 
release (v0530) of core SSP database contains a 
total of 101048 predicted SSPs, which were 
generated through a rigid computation/curation 
pipeline. The web interface consists of eight differ- 
ent modules, providing users with rich resources/ 
functions, e.g. browsing SSP by chromosome, 
searching and filtering SSP, validating SSP with 
omics data, comparing SSP among multiple 
species and querying core SSP database with 
BLAST. Some cases of application are discussed 
to demonstrate the utility of OrysPSSP. OrysPSSP 
serves as a comprehensive resource to explore 
SSP on the genome scale and across the phylogeny 
of plant species. 

INTRODUCTION 

It had been known in animals for years that small secreted 
proteins (SSPs), such as peptide hormones, cytokines/ 
chemokines, digestive enzymes and defensive peptides 



(antibody, neurotoxin, defensin), played critical roles in 
development, metabolism, reproduction, differentiation, 
metamorphosis, predation and other essential aspects of 
life cycles in animals (1-3). Recently, similarly important 
functions of SSPs were discovered in plants, when Pearce 
et al. (4) first identified tomato systemin, an 18-aa peptide, 
which functions as a signal molecule in the defense- 
response cascade. Intensive studies in the following two 
decades unraveled the essential roles of diverse SSP 
in plants' physiology throughout their life cycles (5-13). 

The initial efforts on identification of plant SSP via bio- 
chemical approach made only small progress. They were 
accelerated lately by the available genomic sequences of 
increasing number of plant species, including Arabidopsis 
thaliana, Glycine max and Populus deltoids. To date, 
attempts were made to take advantage of the genome an- 
notation to predict SSP in A. thaliana (14) and P. deltoids 
(15), or to profile plant secretome with computational 
methods (16). Although genomic approach has greatly 
expanded the list of SSPs in plants, to many plant biolo- 
gists and bioinformaticians, there are many short-falls and 
questions remained to be addressed. First, existing 
genome annotation programs are inadequate to annotate 
all SSP in plants. As a result, the numbers of small 
proteins were grossly underestimated in many current 
genome annotations (14). Lease and Walker (14) tried to 
recover the missing SSP from A. thaliana by scanning its 
open reading frame (ORF) encoding short peptides of 
between 25 and 250 aa in length. A total of 33 809 
un-annotated SSPs were predicted in A. thaliana, and 
10247 (30%) were supported by tiling array data. Using 
the 'Coding Index' method, Hanada et al. (16,17) 
identified 7159 possible SSPs from A. thaliana, with 
claimed 1% false discovery rate. However, a separate 
work by Castellana et al. (18) suggested a mere 2% 
confirmation rates in the above A. thaliana studies. 
Hoping to avoid false-positive results by starting with 
trancriptomic data, Yang et al. (15) obtained an initial 
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set of 12 852 0RFs encoding proteins of 1 0-200 aa in 
length from P. deltoids. 

Oryza sativa, one of the major crops worldwide and an 
excellent model for monocotyledonous plants, remained 
open in the study of SSP. Both an important economic 
crop and a model plant, it is our top priority to explore its 
SSPs on the whole-genome scale and compare them with 
other species across the phylogeny of plant species. Here, 
we present OrysPSSP: a comparative Platform for Small 
Secreted Proteins from rice and other plants. In the 
current project, we set out to achieve the following goals: 

(i) Building a database of exhaustive SSPs from 
O. sativa. To make it exhaustive, we created the 
initial dataset by combining a six-frame translation 
and an algorithm for gene model prediction. A pro- 
cessing pipeline followed to filter out false data in 
three steps. 

(ii) Building flexible and effective validation tools to 
minimize false discovery rate and enhance usability. 
We integrated three levels of high-throughput 
experimental datasets, including gene expression 
microarray, RNA-seq and tandem mass 
spectrometry (MS), for the validation of predicted 
SSPs. 

(hi) Building a comparative genomics tool for a compre- 
hensive analysis of the conservation of SSPs in 
26 plant species. We integrated the genome infor- 
mation from 25 plant species besides O. sativa ssp. 
japonica. Comparison across the phylogeny would 
yield insight into the occurrence and evolution of 
SSPs in plant species. 

The present work provides the most comprehensive 
platform for the study of plant SSP. Its database not 
only contains SSPs from rice (the best model plant) but 
also conserves SSPs from 25 other plant species/subspe- 
cies. The current official release (v0530) contains a wholly 
set of 101 048 SSP candidates. About two-thirds of them, 
67 559, are located in un-annotated genome regions in rice, 
while the rest, 33 489, are included in known genes. When 
validated with dataset at three different levels, 33 350 SSPs 
were supported by tiling array data, 9431 by RNA-seq 
data and 18 353 by MS results. When comparing across 
the phylogeny of 25 plant species, we found the number of 
conserved SSPs between rice and other plants, in general, 
was inversely proportional to their evolutionary distance. 

DATABASE CONSTRUCTION 

Data source 

For the reference genome of O. sativa ssp. japonica, we 
used IRGSP1.0 from the Rice Annotation Project (RAP, 
http://rapdb.dna.affrc.go.jp/) (19). The annotation of the 
rice genome was updated by a jointed effort of RAP and 
the MSU Rice Genome Annotation Project (http://rice. 
plantbiology.msu.edu/) (20). For comparative genomics 
analysis, the genomes of 25 green plant species were 
obtained from sources listed in Table 1 (21-30). 

For validation analysis of predicted SSPs, tiling array 
datasets for seedling root, seedling shoot, panicle and 



suspension cultured cells of O. sativa ssp. japonica were 
obtained from the Gene Expression Omnibus (GEO) 
database (GEO Series accession number: GSE6996, http:// 
www.ncbi.nlm.nih.gov/geo/query/acc.cgi7acc = GSE6996) 
(31); RNA-seq datasets from root and tip tissues of O. 
sativa ssp. japonica were downloaded from the Sequence 
Read Archive (SRA) database (study accession: 
SRP007395, http://www.ncbi.nlm.nih.gov/sra7term = 

SRP007365) (32); the proteomics datasets for O. sativa 
ssp. japonica were retrieved from the PRoteomics 
IDEntifications database (PRIDE) (experiment accession: 
15854-15865, http://www.ebi.ac.uk/pride/) (33). 

Data processing pipeline 

A data pipeline was built to predict and annotate SSP in 
O. sativa ssp. japonica (Figure 1). To predict SSPs, a core 
dataset was formed by three steps: 

(i) Constructing the starting dataset of small peptides 
(25-250 aa in length) by combining data from 
whole-genome screening and from gene modeling 
using Augustus (v2.5.5) and FGENESH 
(Softberry, http://www.softberry.com) (34). Whole- 
genome screening was performed by translating 
the rice genome in six-frame using the EMBOSS 
package (35). To recover multiple-exon genes that 
may be missed from the six-frame translation 
approach, the gene modeling programs, Augustus 
(v2.5.5) and FGENESH, were used to predict 
genes that were to be combined with the previous 
gene set. For Augustus, a rice gene set supported by 
full-length cDNAs, expressed sequence tags or 
proteins in IRGSP1.0 were used for training, and 
for FGENESH, the gene model of Z. mays, a 
close neighbor in evolution, was used for gene pre- 
diction in O. sativa ssp. japonica var. Nipponbare. 
The two resulting datasets (90140ORFs from six- 
frame translation and 22 341 from gene modeling) 
were combined and ORFs encoding peptides of 
25-250 aa in length were selected for further 
analysis. 

(ii) Screening for N-terminal signaling sequences on the 
above merged dataset. We used a stand-alone 
software SignalP 4.0 (36) that uses a combination 
of artificial neural networks to predict a signal 
peptide and its cleavage site. As a result, secreted 
peptides that have N-terminal signaling sequence 
were retained for further analysis. 

(iii) Screening for fran.y-membrane domain. The above 
dataset was filtered for the presence of trans- 
membrane helices using TMHMM2.0c (37), which 
indicates a protein resides in the plasma membrane 
or an endomembrane. 

Using the pipeline, a total of 101 048 putative ORFs for 
SSP were identified, and about one-third were novel that 
were located between known genes. 

To reveal possible functions of these peptides, the can- 
didates from core SSP dataset were annotated for 
(i) conserved domains and (ii) organelle location. 
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Table 1. Data source for genomes of 25 green plant species 
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Figure 1. Schematic diagram of the data processing pipeline for OrysPSSP. 
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HMMER(v3.0) was used to scan PfamA database for 
identifying domains in rice SSP, and 7755 SSP genes 
were found to have one or more domain matches 
(38,39). Finally, the organelle location of SSP was pre- 
dicted using TargetPl.l (40). Because of ancient origins 
of secreted proteins in both prokaryotes and eukaryotes, 
we reasoned that putative SSPs that are conserved in 



evolution are more likely to be true secreted proteins. So 
in addition to the validation tool, we provide analytic tool 
comparing rice SSP to those of other plant species. 

Database implementation 

The core SSP dataset of 0. sativa ssp. japonica var. 
Nipponbare generated from the data processing pipeline 
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was stored in a MySQL database (v5.5, http://www.mysql. 
com/) with a relational database design. They include the 
genome location, pre-protein sequence, signaling peptide 
sequence, domain annotation, targeting organelle, neigh- 
boring genes for a SSP as well as some validation infor- 
mation. All OrysPSSP application tools, including 
'Browse', 'Search & Validate', 'Compare Genomes' and 
'BLAST Search', were built on the MySQL database. 
They were implemented in JSP and deployed on the 
Apache Tomcat web server (http://tomcat.apache.org/). 
OrysPSSP can be accessed through IE 6.0 or higher, 
Netscape 7.0 or higher and other web browsers such as 
Safari, Opera, Chrome and Firefox. 

WEB INTERFACE AND INTEGRATED TOOLS 

OrysPSSP web interface (Figure 2) consists of eight differ- 
ent modules, integrating the core SSP database and 
various application tools. The eight modules include 

(i) 'Home': a brief introduction to OrysPSSP; 

(ii) 'Browse': browse SSP from O. sativa ssp. japonica by 
chromosome; (iii) 'Search & Validate': search for SSP 
using one or multiple search parameters and validate the 
resulting SSP by applying one or multiple filtering 
dataset(s); (iv) 'Compare Genomes': apply comparative 
genomics approach to analyze SSPs from O. sativa ssp. 
japonica that are conserved in other plant species; 
(v) 'BLAST search': use 'BLAST' tool to search for SSP 
contained in the query sequences; (vi) 'Statistics': provide 
basic statistics from the data processing pipelines; 
(vii) 'Help': offer answers to frequently asked questions 
about OrysPSSP and (viii) 'Contact us': include contact 
information of support for users. 

OrysPSSPs currently provides users with several flexible 
tools and integrates related genome resources to search, 
validate, filter and compare the data. Some useful links are 
also provided for a number of relevant resources for the 
same purpose. 

Search and validation tool 

'Search & validation' tool module provides some primary 
search and filter functions on the database, including text 
search or filter on chromosome number, strand and/or 
annotation. 

In addition to add value to users' experience for this 
informative and applicable platform, we provide valid- 
ation functions by integrating three levels of experimental 
datasets: (i) at the transcriptional level, we obtained the 
tiling array hybridization datasets of seedling root, 
seedling shoot, panicle and suspension cultured cells 
from O. sativa ssp. japonica; (ii) with the advancement 
of more sensitive 'RNA-seq' technology in detecting tran- 
scripts at low expression levels, we included the O. sativa 
ssp. japonica RNA-seq dataset (SRP007395) from NCBI 
SRA database; and (iii) at translation level, we added a 
peptide spectra dataset (experiment accession: 
15854-15865) extracted from the MS of rice tissues from 
PRIDE (http://www.ebi.ac.uk/pride/) (33). Users can 
select one, two or three levels of data to perform valid- 
ation test on small secreted peptides in OrysPSSP. 



These parameters will be combined as a logical 'AND' 
operation. For validation with the tiling array expression 
data, a threshold of twice the median hybridization inten- 
sity value was used for positive results, and we found 
evidence for the expression of 18 371 putative small 
secreted peptides. For validation with the rice RNA-seq 
data, bases within ORF must be mapped by at least two 
RNA-seq reads; we identified 3992 putative ORFs sup- 
ported by RNA-seq. For validation with the MS data, 
tryptic peptides from each predicted small secreted 
peptides were screened against the MS data with 
X-tandem. We obtained supporting evidence for 
16 657 SSPs from rice. Furthermore, it is easy for a user 
to use different combination of search parameters and 
validation dataset to obtain a subset of SSP that meets 
his/her research needs. 

Comparative analysis to other plant species 

'Compare Genomes' is an advanced tool module that 
applies comparative genomics approach to search for 
SSPs from O. sativa ssp. japonica that are conserved in 
25 plant species ranging from moss to angiosperm, 
including Aquilegia coerulea, Arabidopsis lyrata, 
A. thaliana, Brachypodium distachyon, Capsella rubella, 
Carica papaya, Citrus sinensis, Cucumis sativus, 
Eucalyptus grandis, G. max, Malus domestica, Manihot 
esculenta, Medicago truncatula, Mimulus guttatus, 
O. sativa ssp. indica, Phaseolus vulgaris, Physcomitrella 
patens, Populus trichocarpa, Ricinus communis, 
Selaginella moellendorffii, Setaria italic, Sorghum bicolor, 
Thellungiella halophila, Vitis vinifera and Zea mays. 
It helps users who are interested in studying more conser- 
vative SSPs from rice and in looking for model SSPs that 
have evolutionary root in other plant species. 

This module requires users to input a list of rice SSPs to 
start comparative search. The list of SSPs can be readily 
generated from the results of other tools or created 
manually by users. The user input SSPs are used as 
query to search against the genome sequences of the 
species users select using BLASTp. While users can 
select one or multiple species to perform the search, they 
are treated as logical 'or' in the searching operation. The 
results will show those SSPs that are conserved in any of 
the selected species. 

BLAST search tool 

A 'BLAST' search tool was integrated into OrysPSSP to 
help users to search for SSP (of rice source) that map to 
users' sequences of interest. Users can either input their 
queries into query box or upload a file that contains query 
sequence. The tool was made flexible to allow DNA, 
mRNA or amino acid sequence type of queries. Users 
can also modify the common parameters for BLAST 
tool or can leave them with the default values. 

CASES OF APPLICATION 

For plant biologists, OrysPSSP would be a valuable 
resource for study on the functions of novel SSPs in 
development, signaling, metabolism and reproduction 
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Figure 2. Snap shot of the 'Browse' module from OrysPSSP web interface, which consists of eight modules: (i) 'Home'; (ii) 'Browse'; (hi) 'Search & 
Validate'; (iv) 'Compare Genomes'; (v) 'BLAST search'; (vi) 'Statistics'; (vii) 'Help'; and (viii) 'Contact us'. Details of each module are described 
in web interface and integrated tools. 



in a model plant like rice. To illustrate the utilities of 
OrysPSSP, we here describe two application cases 
inspired by our collaborators. 

Mining novel SSPs from O. sativa ssp. japonica 

Despite the progress made in the study of SSPs in plants, 
their number and identities remain largely unknown in 
rice. To find novel SSP from rice, we used the 'Search & 
Validate' tool by setting the 'Within known gene' option 
to 'No' and leaving everything else unchanged. Validation 
filter was set by checking one of the three check boxes: 
tilingArray, RNAseq or MS. There were 6213 novel SSPs 
in rice that passed validation by tilingArray, and 233 and 
9788 were validated with RNA-seq data or MS data, 
respectively. When tilingArray and MS were combined 
(a moderately stringent filter), 3412 novel SSPs were 
returned, representing a subset of rice SSPs with relatively 
high confidence. 

Identification of conserved SSPs from O. sativa 
ssp. japonica 

Diverse families of SSP have been identified in plants (41). 
They include the CLV3/ESR-related (CLE) family, RALF 
(rapid alkalinization factor), EPF (epidermal patterning 



factor) family, PDF (plant defensin), DEFL (defensin-like 
proteins), CEP1 (C-terminally encoded peptide 1), SCR/ 
SP11, LUREs, etc. To find those conserved SSPs in rice 
and identify new members is of great importance to plant 
biologists. Using the keyword 'CLE' in the 'Search & 
Validate' module (setting 'Within known gene' option to 
'No'), we found a new 'FON2-like CLE protein 2' (ID: 
ory_chr06_5621_spd), which had not been known in rice. 
Using 'Compare Genome' module to compare between 
plant species, ory_chr06_5621_spd was found to be 
specific to rice. Similarly, seven new members of PDF 
(plant defensin) (ory_chrl l_2644_spd, ory_chrl0_ 
5233_spd, ory_chr08_3989_spd, ory_chr08_3386_spd, 
ory_chr05_4899_spd, ory_chr03_2326_spd and ory_chr01 
_6059_spd) and 1 new member of RALF (ory_chrll 
_4864_spd) were discovered in O. sativa ssp. japonica. 

SUMMARY AND FUTURE DEVELOPMENT 

OrysPSSP serves as a comprehensive resource to explore 
SSPs from O. sativa and other plant species on the whole- 
genome scale. It would be beneficial to investigators 
addressing a variety of questions. For geneticists, they 
can query the database for SSP located within the target 
regions of their interest. For plant biologists, the platform 
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is a valuable resource for initiating a study on the func- 
tions of novel SSPs in development, signaling or metabol- 
ism in a model plant. Alternatively, they can check 
whether some peptides induced by stress belong to the 
dataset of un-annotated SSP. Furthermore, they may 
receive gene and domain information for those matched 
peptides. 

Our current study was improved based on methods 
from previous works. Multiple-exon ORF genes were pro- 
cessed by our new pipeline, whereas only single-exon gene 
was included previously (14,16). In addition, three levels 
of omics date were integrated, which give biologists more 
dynamic and flexible tools for validation of SSPs. A com- 
parative genomics approach was applied for investigating 
the diversity of SSP from an evolutionary perspective. 
Still, there are many aspects that OrysPSSP can be 
improved. In the future, we plan to enhance OrysPSSP 
by: (i) integrating tissue-specific omics data from rice for 
analysis of function and tissue specificity of SSP and 
(ii) applying our pipeline to more plant species, as well 
as including high-throughput omics data for validation. 

OrysPSSP is a comprehensive platform to explore the 
full spectrum of SSPs in rice and other plant species. It 
would help advance our understanding of the essential 
roles by SSP and yield new insights into the processes of 
development, differentiation, stress response and symbi- 
osis in plants. 
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